%%%%%% Document general properties %%%%%%

\documentclass[a4paper,12pt]{article}
\usepackage{fixltx2e}[2006/03/24]  % package fixing some LaTeX bugs
\usepackage[american]{babel}
\usepackage[a4paper, dvips=true, scale=0.86, nofoot, centering]{geometry}
\usepackage[utf8]{inputenc}
%\usepackage[T1]{fontenc}
\usepackage{microtype}
\usepackage{color}

%%%%%% Sciences addons  packages %%%%%
%\usepackage{achemso}
\usepackage{amsmath}
\usepackage{amssymb}
%\usepackage{amsbsy}
%\usepackage{chemarr}
%\usepackage{chemarrow}




%%%%%%%%  Graphic packages  %%%%%%%%%%%%
\usepackage[dvips]{graphicx}   % final mode
%\usepackage[draft,pdftex]{graphicx}   % draft mode
%\usepackage[dvips]{color}
%\usepackage{psfrag} 



%%%%%%%%%%% Fonts packages %%%%%%%%%%%%%%
%\usepackage{euler}
%\usepackage{mathrsfs}
%\usepackage{times}
%\renewcommand{\ttdefault}{txtt}
%\usepackage{frmath}
%\usepackage{marvosym}
\usepackage{pxfonts} % way better for reading PDF on screen


%%%%%%%%%%%  float/tables packages %%%%%%%%%%%%%
%\usepackage{multirow}
%\usepackage{multicol}
%\usepackage{arydshln}
%\usepackage{float}
%\usepackage{nonfloat}
%\usepackage{placeins}
%\usepackage[ruled,rm,sc,bf]{caption}
%\usepackage{afterpage}


%%%%%%%%%% Additional Packages %%%%%%%%%%
\usepackage{hyperref}
\usepackage{breakurl}

%\newcommand{\upd}{\genfrac{}{}{0pt}{0}{\nearrow}{\searrow}}
%\newcommand{\ffrac}[2]{\genfrac{}{}{0pt}{0}{#1}{#2}}


%%%%%%%%%%%%%%  Presentation properties %%%%%%%%%%

\pagestyle{headings}
%\sloppy
%\fussy
%\renewcommand{\baselinestretch}{2}
\setlength{\columnsep}{5mm}
\setcounter{tocdepth}{5}
\usepackage{indentfirst}

%\renewcommand{\thefootnote}{\alph{footnote}}
%\makeatletter
%\renewcommand{\@makefnmark}{\textsuperscript{\texttt \thefootnote}}
%\makeatother


\renewcommand{\descriptionlabel}[1]{$\longrightarrow$ '\texttt{#1}' :}
\newcommand{\note}[1]{\textcolor{red}{#1}}

%\floatstyle{ruled}
%\newfloat{scheme}{ht}{lsch}
%\floatname{scheme}{\textsc{Scheme}}
%\restylefloat{figure}

%%%%%%%%%%%%%% Bibliography  %%%%%%%%%%%%%%%%%%%%%

%\usepackage{footbib}
%\footbibliography{biblio}
%\footbibliographystyle{unsrt}
%\usepackage{overcite}


%%%%%%%%%%%%%%%%%%%%%%% Document %%%%%%%%%%%%%%%%%%%%%%%%%%%

\title{Description of the pencil-code HDF5 files \texttt{v0.1}}

\author{}


\hypersetup{%
  pdftitle = {Description of the pencil-code HDF5 files},
  pdfauthor = {Rapha\"el Plasson} ,
  pdfcreator = {\LaTeX\ with package \flqq hyperref\frqq} 
}


\begin{document}


\maketitle
\thispagestyle{empty}

%\tableofcontents

%\listoffigures

\section{Data structure}
\label{sec:data-structure}

In the following description, the full pathway for both subgroups and
datasets is given for better readability. The subgroups are written
with a trailing '$\slash$' to be distinguished from the datasets. The
attributes relative to groups and datasets are written without the
pathway. The different elements are presented in this order: after
each subgroup follow its different attributes, then its datasets,
then its embedded subgroups.


Please note that this is a first working version (thus the
\texttt{v0.1}...), intended to be a proposal to be discussed. It
should be considered as experimental, and thus shouldn't be expected
to be stable (i.e.\  its format may evolve a lot).


\textbf{Important :} Please update this file immediately when you
update the pencil-code HDF5 format, so that this description is always
correctly describing the current implementation.

\note{Notes and questions to be discussed are written in red. I
  suggest to increase the version number of the HDF5 file to
  \texttt{1.0} once we will have agreed and get rid of all these
  notes.}

\begin{figure}[p] \centering
  % \fbox{
  \parbox{18cm}{\tiny %Yes, it's small, the purpose is to try to
                            %have the whole structure on one single
                            %page... Well anyone can zoom on pdf files :)
    \begin{description}
    \item[/]  Root of the file
      \begin{description}
      \item[name] String Attribute. Set to ``PencilCode''. This attribute
        is intended to describe the type of data stocked there,
        i.e.\  pencil code data. Any informations concerning the run itself
    are intended to be stocked in the respective subgroups.
  \item[ver] String Attribute. Version of the
    pencil-code HDF5 file. The format described in this file
    corresponds to the version \texttt{v0.1}.
  \item[WorkDir] Directory of the Pencil Code run. It is expended to
    the full pathname from the entered working directory.
  \item[dateC] String Attribute. Date of creation of the
    file. All dates are formatted as a string typeset as ISO 8601 dates in
    the format 'YYYY-MM-DDTHH:mm:ss.SSS', using the python
    \texttt{datatime} module.
  \item[dateM] String Attribute. Date of last modification of the
    file.
  \item[precision] String Attribute. Set up there as is can be set up
    independently of the format used in Pencil Code data (typically
    used to force the HDF file data to be converted from double to
    single, for reducing the file size) 
  \item[/param/] Subgroup intended to stock all the parameters data
    relative to the runs. \note{I assume that all the relevant
      parameters can be found in \texttt{params.log} in datafile. The interest
      of reading the parameters there is that we have not only all the
      parameters obtained during init, but also before \emph{each} run,
      so that all parameters changes made after each run can be
      found.}
    \begin{description}
    \item[/param/dim/] Subgroup containing the parameters available in
      \texttt{dim.dat} files. The first index of the data corresponds
      to the different files (0: global file, 1:pirst proc, 2:second
      proc, etc.)
      \begin{description}
      \item[/param/dim/nbproc] Integer dataset of dimension
        (1,). Number of proc.
      \item[/param/dim/mx] Integer dataset of dimension (nbproc,1) of
        the mx parameters in \texttt{data/dim.dat},
        \texttt{data/proc0/dim.dat}, etc... 
      \item[/param/dim/...] etc.
      \end{description}
    \item[/param/init/] Subgroup containing the parameters
      passed during the \texttt{init} phase.
      \begin{description}
      \item[/param/init/init\_pars/]
        Subgroup of the parameters of
        \texttt{init\_pars}. \note{Please note that all the subgroups
          and parameters name are written in lowercase to avoid any
          potential confusion.}
        \begin{description}
          \item[/param/init/init\_pars/cvsid] String dataset of
            dimension (1,), non-resizable
          \item[/param/init/init\_pars/ip] Integer dataset of
            dimension (1,), non-resizable
          \item[/param/init/init\_pars/xyz0] Float dataset of
            dimension (3,), non-resizable
          \item[/param/init/init\_pars/...] All the parameters will be
            added. \note{It has been chosen to list each parameter in
            separate datasets rather than in a single large dataset
            array, because it gather different types (and sometimes
            even arrays of data)... but maybe this should be discussed,
            and a single array would be better ?}
        \end{description}
      \item[/param/init/hydro\_init\_pars/] Subgroup of the parameters
        of \texttt{hydro\_init\_pars}
        \begin{description}
        \item[/param/init/hydro\_init\_pars/inituu] String dataset of
            dimension (1,), non-resizable
        \item[/param/init/hydro\_init\_pars/...] etc.
        \end{description}
      \item[/param/init/...\_init\_pars/] etc. \note{Do you think that we
        should create the subgroups in all cases, or only in the cases
        there are referred to in the pencil code files (i.e. when they
        are really needed for the run). In the present implementation,
        I chose to only add the relevant subgroups, rather than add
        empty one.}
      \end{description}
    \item[/param/run/] Subgroup containing the parameters
      passed during each \texttt{run} phase. in all the arrays, the
      first dimension corresponds to the number of the run.
      \begin{description}
      \item[/param/run/timerun] Float dataset of
        dimension (\texttt{nbrun},1), first dimension resizeable, \texttt{nbrun} being
        the number of run that were realized. Note that this
        parameter do note appear as a pencil code parameter, but is
        written in the log file each time a new run is launched as a
        continuation from preceding runs.
      \item[/param/run/run\_pars/] Subgroup of the parameters of
        \texttt{run\_pars} 
        \begin{description}
        \item[/param/run/run\_pars/nt] Integer dataset of
            dimension (\texttt{nbrun},1), first dimension resizable
        \item[/param/init/run\_pars/...] etc.
        \end{description}
      \item[/param/run/chemistry\_run\_pars/] Subgroup of the
        parameters of \texttt{chemistry\_run\_pars} 
        \begin{description}
        \item[/param/run/chemistry\_run\_pars/mobility] Float dataset of
            dimension (\texttt{nbrun},\texttt{nbchem}), first
            dimension resizable. \texttt{nbchem} is the number of
            chemical species.
        \item[/param/init/chemistry\_run\_pars/...] etc.
      \item[/param/run/...\_run\_pars/] etc.
        \end{description}
      \end{description}
      \item[/param/index/] Subgroup containing the parameters from
        \texttt{index.pro}. \note{This is a fast implementation, as it seem that
        some of these parameters are needed for plotting \texttt{VAR}
        files, but I must look into it more thoroughly (there are,
        e.g. several parameters present several time in this file...)}
    \end{description}
  \item[/data/] Subgroup intended to stock all the data resulting from
    the runs.
    \begin{description}
    \item[/data/time\_series\_names] String dataset of the name of the
      time series column.
    \item[/data/time\_series] Float dataset of the time series, of
      dimension (timeslices, nbvar),  \texttt{timeslices} corresponds
      to an output every \texttt{/param/run/run\_pars/it1} time
      steps, and \texttt{nbvar} corresponds to the number of
      diagnostic variables entered in the file \texttt{print.in}
      \item[/data/slices\_names] String dataset of the variables that
        were plotted in slices, as described in \texttt{video.in}, of
        dimension (nbslices,)
      \item[/data/slices\_time] Float dataset of the snapshot time of each
        slice snapshot, of dimension (islice,)  corresponding to an output
        every \texttt{/param/run/run\_pars/dvid} unit of time
      \item[/data/slices\_xy] Float dataset of the slices, of dimension
        (islice,nbslices,x,y),  \texttt{x,y} are the grid dimension in the chosen slice
        available in \texttt{/param/dim/nx} and \texttt{/param/dim/ny}.
      \item[/data/slices\_xy2] 
      \item[/data/slices\_xz]
      \item[/data/slices\_yz] id. for corresponding slice
    \item[/data/var] Float dataset of the snapshots, of dimension
      (timeslices3,x,y,z), \texttt{timeslices3} corresponds to an
      output every  \texttt{/param/run/run\_pars/dsnap} unit of time,
      and \texttt{x,y,z} is the dimension of the whole grid
    \end{description}
  \item[/etc/] Subgroup intended to stock some additional
    informations about the runs.
    \begin{description}
    \item[/etc/notes] Dataset of Variable Length Strings. Dimension:
      resizable (1,). This dataset is intended to stock written notes
      about the run. Each note can be of arbitrarily long size, and a
      arbitrary number of notes can be added. More precisely, each
      note is in HDF5 VL string format (VL standing for Variable
      Length), and the dataset is a resizable array of VL strings, of
      dimension one, initially created with one (empty) string
      element.
    \item[/etc/ext/] Subgroup intended to stock some external datasets,
      e.g. from experimental data.
    \item[/etc/...] This subgroup should be left open to any personal
      modifications, so that any additional datasets or subgroups can
      be entered there as wanted.
    \end{description}
  \end{description}
\end{description}
}%}
\caption{Data structure of a pencil-code HDF5 file}
\label{fig:data-struct}
\end{figure}
\section{Example of use}
\label{sec:example-use}

All the following example are given in python, using the module
\texttt{h5py}. This module was chose rather than the other
implementation (\texttt{python-tables}) for being simple of use
through the high-level component, while at the same time offering
access  to the almost entire HDF5 C API through the low-level
components. There is nothing python-specific, so that the created
HDF5 files will (should...) be readable by any other HDF5
implementation, independently of the chosen language. Note that for
the moment, direct access to the HDF5 must be done (using high-level
compounds of h5py module), but I intend to write a python interface to
access and modify these data. 

All the following example are supposed to be written on the python
command line, after having imported the HDF5 module by:
\begin{verbatim}
>>> import h5py
>>>
\end{verbatim}
The ``\texttt{>>>}'' corresponds to the python prompt, and thus indicate
the commands typed by the user. The lines without the prompt are the
answers returned by python.
In the following examples, we will assume that pylab and numpy were
also imported as:
\begin{verbatim}
>>> import pylab as P
>>> import numpy as N
>>>
\end{verbatim}

\subsection{Accessing a file}
\label{sec:opening-file}

A file may be open by the command \texttt{h5py.File}. Different
informations can be obtained from the returned object.
\begin{verbatim}
>>> f=h5py.File("datafile.hdf5","a")
>>> f
<HDF5 file "datafile.hdf5" (mode a, 3 root members)>
>>> f.attrs
<Attributes of HDF5 object "/" (4)>
>>> f.keys()
['data', 'notes', 'param']
>>>
\end{verbatim}

Be careful, all the modifications that are made to a file are
buffered. It is necessary to either flush or close the file if one
wants to be sure that everything has been written on the disk.
\begin{verbatim}
>>> f.flush()
>>> f.close()
>>>
\end{verbatim}

\subsection{Reading the file attribute}
\label{sec:read-file-attr}

Attributes of the file can be directly read, so that we can check that
this is a pencil code data file, which version of file format it is,
and when it was created and last modified:
\begin{verbatim}
>>> f.attrs["name"]
'PencilCode'
>>>> f.attrs["ver"]
'v0.1'
>>> f.attrs["dateC"]
'28/09/09'
>>> f.attrs["dateM"]
'28/09/09'
>>>
\end{verbatim}
Note that one can easily modify these attributes, or to create to
one. It is advisable to not do so !

\subsection{Slices}
\label{sec:slices}

One may easily have access to any slice:
\begin{verbatim}
>>> f['data/slices_names'][...]
array(['lnrho', 'b2', 'rhop'], 
      dtype='|S5')
>>>
\end{verbatim}
These are the three varaibales that were snapshot in slices.
\begin{verbatim}
>>> rhop=f['data/slices_yz'][:,2,:,:]
array(['lnrho', 'b2', 'rhop'], 
      dtype='|S5')
>>> rhop.shape
(9, 16, 64)
>>>
\end{verbatim}
We can copy the slice corresponding to \texttt{rhop} (i.e. variable of
index 2) and one can get rhop as an array, here of 9 slices of
dimension 16x64.

A slice can be directly plotted by:
\begin{verbatim}
>>> P.imshow(f['data/slices_yz'][5,2,:,:])
>>> P.imshow(f['data/slices_xy'][:,1,:,16])
>>>
\end{verbatim}
The first one plots the fifth yz-slice of variable 2. The second one
plots the segment y=16 of the xy-slices of variable 1 as a function of
time. 


\subsection{notes}
\label{sec:notes}

Initially only \texttt{f['notes/note'][0]} exists, and it is
set to the empty string \texttt{''}:
\begin{verbatim}
>>> f['/notes/note'][0]
''
>>> f['/notes/note'][1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  (... snip here verbose error messages ...)
ValueError: Index (1) out of range (0-0)
>>>
\end{verbatim}

Any data can be entered in the note, and additional notes can be created:
\begin{verbatim}
>>> f['notes/note'][0]="This is a test file"
>>> f['notes/note'].resize((2,))
>>> f['notes/note'][1]="This is an additional note"
>>> f['notes/note'][1]
'This is an additional note'
>>> f['notes/note'][...]
array([This is a test file, This is an additional note], dtype=object)
>>> 
\end{verbatim}

Object can be used to directly access to the notes:
\begin{verbatim}
>>> note=f['notes/note']
>>> note[0]
'This is a test file'
>>> note
<HDF5 dataset "note": shape (2,), type "|O4">
>>> 
\end{verbatim}

\section{Discussion}
\label{sec:discussion}

\subsection{SVN version}
\label{sec:svn-version}



We should have some identifier for the version of the code that created
the data.
`\texttt{svnversion}' can provide this, but it would have to be called
from \emph{pc\_run}, not from the \emph{h2py} script.
Unfortunately, \texttt{svnversion} can sometimes take 5 or 10
seconds\ldots
\emph{\quad--~wdobler}

What can be easily done may be to parse the file
\texttt{data/svn.id}, and write all the info in, for example a
\texttt{SVN} subgroup (in an array containing all the files with their
svn version number), or even more simply to read all the svn version
numbers of this file, take the larger one, and write it in an attribute
of the root. What do you think ? Is this file updated after every
build ?
\emph{\quad--~rplasson}


\subsection{Filters}
\label{sec:filters}

We may add options about the possible filters. There should be either
the possibility of manual tuning, or fastest (null=no compression,
data in order), fast (shuffle+lzf=compression with optimization of
access time) or small (shuffle+gzip=best compression available). See
benchmark on \url{http://h5py.alfven.org/lzf/}.
\emph{\quad--~rplasson}


\end{document}
